Skip to content

HDDS-15066. Read-Write Lock race leave stale references to container creating orphan replicas#10109

Open
Gargi-jais11 wants to merge 9 commits intoapache:masterfrom
Gargi-jais11:HDDS-15066
Open

HDDS-15066. Read-Write Lock race leave stale references to container creating orphan replicas#10109
Gargi-jais11 wants to merge 9 commits intoapache:masterfrom
Gargi-jais11:HDDS-15066

Conversation

@Gargi-jais11
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

On a datanode, some work runs under a container read lock (or otherwise changes which replica directory / in-memory Container is authoritative) while other threads look up a container once and later take a write lock. If the in-memory container mapping or on-disk location changes in between, the second thread can still use a stale Container or KeyValueContainerData reference. That is a classic TOCTOU problem: the map and the caller disagree about where the replica lives.

Worst cases include wrong ContainerSet updates, deleting or updating the wrong paths, ghost / orphan replica data on disk that SCM no longer tracks, and block deletion targeting the wrong RocksDB/chunks tree so pending deletes on the live replica are never applied (space not reclaimed).

Applies to CLOSED and QUASI_CLOSED (and any path where balancing/replication overlaps lifecycle commands), not a single state.

This can be easily explained with help of DiskBalancer as an example, although it applies to other read write races as well:
While DiskBalancer moves a container between volumes on the same datanode, it holds the container read lock, copies data, then calls ContainerSet.updateContainer(...) to point the in-memory map at a new Container instance (new volume / paths).
Whereas it is seen that other threads often look up the container once, then block on writeLock() until the move finishes. After they unblock, they still hold a reference to the old container (source volume). They then run logic and/or removeContainer(containerId) using stale paths and stale object identity.

Impacts:

  1. Replication Manager DeleteContainerCommand: can remove the wrong map entry (live replica on the destination volume), delete/move source files, and leave an orphan replica on disk that SCM no longer sees — plus wrong volume accounting and ICRs keyed off the old replica.
  2. BlockDeletingTask: uses a task-scoped KeyValueContainerData snapshot (source paths) while getContainer() may return the new replica; block cleanup can target the wrong chunks path and never clear pending-deletion state on the live replica → space not reclaimed.

Suggested Fix:

For RM side: After acquiring writeLock(), re-fetch from the map and compare by identity. If they differ, the container was moved — abort and let the caller retry.

container.writeLock();
try {    
Container<?> current containerSet.getContainer( container.getContainerData().getContainerID());
if (current != container) { 
// Container was relocated by DiskBalancer while we waited for the lock.Our reference is stale; the operation will be retried on the next cycle. 
 return;    
 } 
 // ... proceed with the actual operation
}

For BlockDeletingService:

container.writeLock();

// Re-fetch AFTER lock: map already has disk-B in scenarios.
// current.getContainerData() is disk-B's ContainerData ≠ this.containerData 
(disk-A snapshot).

Container<?> current = containerSet.getContainer(containerData
.getContainerID());
if (current == null || current.getContainerData() != containerData) {   
// containerData is stale — DiskBalancer relocated this container. BlockDeletingService will reschedule with disk-B's fresh containerData.

return;
} 

For DiskBalancer:
Move the container state check after acquiring readLock() to prevent stale references about container.

container.readLock();
try {
// Double check container state before acquiring lock to start move process.
// Container state may have changed after selection.
State containerState = container.getContainerData().getState();
if (!movableContainerStates.contains(containerState)) {
  LOG.warn("Container {} is in {} state, skipping move process.", containerId, containerState);
  postCall(false, startTime);
  return BackgroundTaskResult.EmptyTaskResult.newResult();
}   // Step 1: Copy container to new Volume's tmp Dir
  diskBalancerTmpDir = getDiskBalancerTmpDir(destVolume)
      .resolve(String.valueOf(containerId));
  ozoneContainer.getController().copyContainer(containerData, diskBalancerTmpDir); 
....

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15066

How was this patch tested?

Added unit tests on diskBalancer side as an example for the race condition which can happen with any services read/write Lock race.
Test file: TestDiskBalancerWithConcurrentBackgroundTasks

@Gargi-jais11 Gargi-jais11 marked this pull request as ready for review April 23, 2026 05:54
@Gargi-jais11
Copy link
Copy Markdown
Contributor Author

@ChenSammi @sadanand48 Please review the patch as per your convinence.

Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Gargi-jais11 for the patch.

Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Gargi-jais11 for updating the patch.

@Gargi-jais11
Copy link
Copy Markdown
Contributor Author

@sodonnel Please review this PR whenever you are free.

@sodonnel
Copy link
Copy Markdown
Contributor

This change looks good and I think it will solve the problem. However I think it reveals a problem with the code structure within the datanode. These services should not have to perform such complex locking to get a consistent view of a container and it would be very easy in the future for some other service to come along and not do things correctly, or indeed some other existing part of the DN code may also be doing things incorrectly already.

I think that all the container "logic" should be hidden behind an interface and all the services like disk balancer, or block deletion should call it, eg:

containerManager.moveContainer(container_id, destination)
containerManager.deleteBlockList(container_id, blocksToDelete)
...
etc

Then all the locking etc happens behind the scenes in the same place where it can be controlled more closely.

Fixing this is a large exercise and not something we would want to take on in this PR.

There are other places in KeyValueHandler that take the container lock - are we sure they are OK or is any change needed in them too?

@Gargi-jais11
Copy link
Copy Markdown
Contributor Author

Gargi-jais11 commented Apr 29, 2026

This change looks good and I think it will solve the problem. However I think it reveals a problem with the code structure within the datanode. These services should not have to perform such complex locking to get a consistent view of a container and it would be very easy in the future for some other service to come along and not do things correctly, or indeed some other existing part of the DN code may also be doing things incorrectly already.

I think that all the container "logic" should be hidden behind an interface and all the services like disk balancer, or block deletion should call it, eg:

containerManager.moveContainer(container_id, destination)
containerManager.deleteBlockList(container_id, blocksToDelete)
...
etc

Then all the locking etc happens behind the scenes in the same place where it can be controlled more closely.

Fixing this is a large exercise and not something we would want to take on in this PR.

There are other places in KeyValueHandler that take the container lock - are we sure they are OK or is any change needed in them too?

I agree with u @sodonnel that all locking mechanism should be hidden behind an interface that it happens at the same place.
I was going through other write locks in the code base. I see that when EC Under-replication + DiskBalancer there can be failure however it cannot be the problem but we can minimise the effort here.
Setup: Container C (RS-6-3), shard index=2 exists only on DN1 (under-replicated). DiskBalancer on DN1 simultaneously moves shard idx=2 from Disk1 → Disk2.

The race:

DiskBalancer (DN1)                    EC Reconstruction (DN5 coordinator)
─────────────────────────────────     ────────────────────────────────────────
container.readLock()                  
  copy Disk1 → tmp/                                              pulls idx=0 from DN3 ✓
  atomic move tmp → Disk2                                 pulls idx=1 from DN4 ✓
  containerSet.updateContainer()                   containerSet.getContainer(C) → OLD obj (Disk1)
  readUnlock()                                                   tries writeLock → was blocked, now unblocked
  markContainerForDelete(OLD)                        OLD.state = DELETED
                                                                            exportContainerData checks state → DELETED
                                                                           throws IllegalStateException → FAILS
                                                                         ← entire ReconstructECContainersCommand fails
                                                                           bandwidth for idx=0, idx=1 already wasted

Net result: EC reconstruction fails just because the shard at DN1 has been marked as DELETED while it correctly pulled index from other DNs. Container C still has shard idx=2 missing. RM re-queues it next monitor cycle.

Here we have two options:
Option 1 — Rely on RM to re-send (current behavior)

  • RM re-queues the container after command failure/timeout
  • Next cycle: DN1's heartbeat has already reported the new path for idx=2 on Disk2
  • RM sends a fresh ReconstructECContainersCommand — this time export succeeds on the NEW container
  • Cost: 1 extra monitor cycle delay + bandwidth for idx=0, idx=1 already wasted in the failed attempt
  • Safe? Yes — no data loss, just delayed fix. But every failed attempt wastes bandwidth for all N-1 other shards too.

Option 2 — Re-fetch container from ContainerSet after getting the lock like others
In exportContainerData, after acquiring the lock, re-fetch the container from ContainerSet by ID, so even if the old container is marked DELETED the new container moved by diskbalancer will be fetched and will not fail ECReconstruction.

Please let me know which option do @sodonnel @ChenSammi you both prefer ? RM retry is simpler and already works — it just costs one extra cycle and wasted network I/O for every time the race window is hit. I just wanted to bring this in your knowledge and discuss.

With Ratis UnderReplication and diskbalancer working on same container there is no issue as it will replicate only from any one of the container replica so even if the contaienr state is DELETED it will be fail or if before changing to delete UnderReplication acquires lock then it will be successful. But for EC just because of one shard entire operation needs to be repeated again as shared above.

@Gargi-jais11
Copy link
Copy Markdown
Contributor Author

Second Finding: For QUASI_CLOSED container if SCM sends force Close command and DiskBalancer is working on same container
QUASI_CLOSED containers can be force-closed by SCM (CloseContainerCommand with force=true). That goes through controller.closeContainer().

Here is the exact timeline:

DiskBalancer (DN1)                                                              CloseContainerCommandHandler
─────────────────────────────────────────        ─────────────────────────────────────────
T1: container = containerSet.getContainer(C)
    → OLD container (QUASI_CLOSED, Disk1)

T2: container.readLock() on OLD

T3: copy Disk1 → Disk2 ...                                       T3a: container = containerSet.getContainer(C)
                                                                                   → OLD container (before updateContainer)
                                                                                   T3b: switch(container.getContainerState())
                                                                                   → QUASI_CLOSED + force=true
                                                                                    → controller.closeContainer(id)
                                                                                  → containerSet.getContainer(id)
                                                                                  → OLD container (still, before T5)
                                                                                  → container.close() 
                                                                                  → writeLock() → BLOCKED     ←-------   readLock held

T4: copy done, atomic move to Disk2
T5: importContainer → 
newContainer (QUASI_CLOSED, Disk2)
T6: containerSet.updateContainer(newContainer)
    ← ContainerSet now maps C → newContainer
T7: container.readUnlock()  ← releases
 OLD readLock
                                                                                      T7a: writeLock ACQUIRED on OLD container
                                                                                       → OLD: QUASI_CLOSED → CLOSED
                                                                                       → sendICR(OLD=CLOSED) → SCM told C is CLOSED

T8: container.markContainerForDelete(OLD)
    → writeLock → OLD: CLOSED → DELETED

after T8

	                                       State	                                                In ContainerSet?
---------------------------------------------------------------------------------------
OLD container (Disk1)         DELETED                                             No (updateContainer removed it)

NEW container (Disk2)       QUASI_CLOSED                                  Yes — this is the live replica

SCM's view: Container C on DN1 = CLOSED (from ICR sent at T7a), Reality: Container C on DN1 = QUASI_CLOSED (newContainer).

This is a kind of regression:
SCM thinks it's CLOSED. But DN1's next container report says QUASI_CLOSED. SCM sees a state "regression" (CLOSED → QUASI_CLOSED). Depending on the FCR sent to SCM, it may:

Re-send a force close command → controller.closeContainer(id) now re-fetches from ContainerSet → gets NEW container → closes it correctly → CLOSED. Eventually converges.
Or treat it as an unhealthy/inconsistent replica.
No data loss — the data is intact on Disk2. But there is a state inconsistency window where SCM's cached state (CLOSED) differs from reality (QUASI_CLOSED on the new disk).

I think here as well we need to re-fetch the container .

@Gargi-jais11
Copy link
Copy Markdown
Contributor Author

for markContainerForUnhealthy with DiskBalancer parallely working . I believe this also needs to refetch the container after writeLock.
Here it works for any container state to mark all — CLOSED, QUASI_CLOSED, OPEN, RECOVERING — all can become UNHEALTHY.

Case 1: Container CLOSED/QUASI_CLOSED + DiskBalancer + Scanner in parallel
DiskBalancer (DN1)                                                                Container Scanner (DN1)
──────────────────────────────                                                    ──────────────────────────────────────────
T1: selects C (CLOSED, Disk1)
    added to inProgressContainers

T2: container.readLock() on OLD

T3: copy Disk1 → Disk2                                                            scanner reads OLD files on Disk1
    (I/O in progress)                                                             finds checksum failure
                                                                                  controller.markContainerUnhealthy(id, reason)
                                                                                    containerSet.getContainer(id)
                                                                                    → OLD container (Disk1)  ← stale ref
                                                                                    handler.markContainerUnhealthy(OLD, reason)
                                                                                    → writeLock() → BLOCKED (readLock held)

T4: copy done, checksum verified
T5: importContainer → NEW (CLOSED, Disk2)
T6: containerSet.updateContainer(NEW)
    ← ContainerSet maps C → NEW (Disk2)

T7: readUnlock() on OLD

                                                                                    writeLock ACQUIRED on OLD
                                                                                    state = CLOSED, not UNHEALTHY, volume not failed
                                                                                    OLD: CLOSED → UNHEALTHY   ← wrong container
                                                                                    writeUnlock
                                                                                    sendICR: C on DN1 = UNHEALTHY  ← stale, wrong

T8: markContainerForDelete(OLD) → DELETED

Final state:
  OLD (Disk1): DELETED
  NEW (Disk2): CLOSED  ← healthy, valid
  SCM view of container C:   C on DN1 = UNHEALTHY  ← wrong

What SCM/RM does in response :

T9:  RM next cycle — ECUnderReplicationHandler.checkAndRemoveUnhealthyReplica()
       SCM replica record: DN1 has UNHEALTHY replica of C
       checks: is there a CLOSED replica for same index on another DN? 
         → if YES: "prefer deleting the UNHEALTHY over CLOSED" → sendThrottledDeleteCommand(DN1)
         → if NO CLOSED elsewhere: "delete any UNHEALTHY" → sendThrottledDeleteCommand(DN1)
T10: DN1 receives DeleteContainerCommand for C
       containerSet.getContainer(id) → NEW container (Disk2, CLOSED)
       NEW container DELETED  ← healthy valid replica gone
       
     Now container C is genuinely under-replicated.
     RM tries to fix it by replicating — but the replica it just deleted was the source.

Outcome:  A healthy replica on Disk2 gets deleted. Container C becomes genuinely under-replicated.
The window between readUnlock (T7) and markContainerForDelete (T8) is the critical period. If the scanner's sendICR reaches SCM and RM processes it before FCR corrects the state, the delete command lands on DN1 and hits the healthy NEW container. This is why markContainerUnhealthy needs the re-fetch — the consequences of operating on the wrong container are irreversible.

@Gargi-jais11
Copy link
Copy Markdown
Contributor Author

@sodonnel @ChenSammi above are more new areas which need same fix. Rest all do not have any issue like this. I have done thorough code checking.

@sodonnel
Copy link
Copy Markdown
Contributor

sodonnel commented Apr 29, 2026

In exportContainerData, after acquiring the lock, re-fetch the container from ContainerSet by ID, so even if the old container is marked DELETED the new container moved by diskbalancer will be fetched and will not fail ECReconstruction.

For EC Reconstruction I don't think it exports the entire container like this. More it reads the blocks out of the container block by block using the normal read path through the datanode. So the question is - can the normal read path be impacted by the balancer moving a container from disk 1 to disk 2? Ideally, the DN should be safe to do this using its locking as it is not ideal that reads would fail randomly. If its Ratis, it can just try another replica with s small delay. With EC it would fall back to reconstruction reads, which are slower and use more resources.

@sodonnel
Copy link
Copy Markdown
Contributor

This problem is kind of getting difficult to see if we have solved all the cases or not.

The root cause is poor interface design, and further inadequate locking. Fixing this is going to be difficult.

The locking is only inadequate because the lock is inside the container object and the container object can be replaced by a new object against the same ID.

What are all the scenarios that replace that container object? What is we instead updated the existing object in place under a lock? Would that solve all these edge cases more simple? It sounds like these problems all come from the disk balancer moving and replacing the container object, so if we fix just that, it may be a cleaner solution to this whole problem?

@sodonnel
Copy link
Copy Markdown
Contributor

Looking at the containerSet class, the updateContainer() method was added only for Diskbalancer and is only called by disk balancer. The only other place items are added to the containerMap is in the addContainer() method, so it seems like this could all be solved by updating the Container object fields rather than creating a new instance and replacing it. It doesn't appear to be an immutable object so this would be a much smaller fix and I think will solve all the issues?

@Gargi-jais11
Copy link
Copy Markdown
Contributor Author

For EC Reconstruction I don't think it exports the entire container like this. More it reads the blocks out of the container block by block using the normal read path through the datanode.

@sodonnel My bad I got diverted while exploring. You are correct for ECReconstruction, it reads chunk by chunk per block, over gRPC, using GetBlock + ReadChunk RPCs. The data flows from the source DN's normal read handler (HddsDispatcher → KeyValueHandler.handleReadChunk) which does a read lock per chunk, not a write lock.

So the question is - can the normal read path be impacted by the balancer moving a container from disk 1 to disk 2?

  • DiskBalancer holds readLock during the copy. A concurrent ReadChunk or GetBlock call for the same container also needs readLock — multiple readers allowed simultaneously. So they don't block each other. Reads are safe during DB's copy phase.
  • After updateContainer, the ContainerSet maps to the new object. The read handler calls containerSet.getContainer(id) fresh on every RPC — it gets the NEW object pointing to Disk2. Reads proceed on Disk2. Safe after the swap.
  • The only window: if the gRPC channel had fetched the container reference before updateContainer, it still reads from Disk1 files via the OLD object. Since read lock doesn't conflict with another read lock, reads work fine. Disk1 files still exist until replicaDeletionDelay. Safe, but again there can be issue if ECReconstruction has still not completed from old container which will be deleted eventually after 5mins of movement to new disk then read failure will happen for ECReconstruction.

What scenarios replace the container object in ContainerSet?

What I can see from the codebase is :

addContainer()         → called at DN startup (ContainerReader), 
                         incoming replication (importContainer),
                         EC reconstruction target creation
updateContainer()      → called ONLY by DiskBalancerService (line 557)

That's it. updateContainer for diskBalancer is the only production path that swaps the live container object.

@Gargi-jais11
Copy link
Copy Markdown
Contributor Author

Gargi-jais11 commented Apr 30, 2026

What is we instead updated the existing object in place under a lock? Would that solve all these edge cases more simple? It sounds like these problems all come from the disk balancer moving and replacing the container object, so if we fix just that, it may be a cleaner solution to this whole problem?

@sodonnel I believe your suggestion is correct and cleaner. Updating fields in-place wipes out the path to Disk1. But there are many things which need to be updated if the existing object will be updated and doing that is no an easy process as these filed are considered immutable. So this will be our challenge point which can introduce new edge cases for balancer.

So to resolve this what I can think of is we can save the old path in local variables before acquiring the write lock and updating existing container object. The lock only needs to protect the swap of the live in-memory fields; path cleanup uses the saved local.

So according to this the new proposed diskbalancer flow will look something like this:

Phase A: File-level copy (identical to today — no change here)
  A1. container.readLock()
  A2. copyContainer(Disk1 → tmp/)                      <-------------------- files copied
  A3. atomic Files.move(tmp → Disk2/final)          <-------------------- .container YAML on Disk2 is final

Phase B: In-place update (replaces current updateContainer call)
  B1. Save: oldMeta = containerData.getMetadataPath()
            oldChunks = containerData.getChunksPath()
            oldVolume = containerData.getVolume()
            oldDbFile  = containerData.getDbFile()
  B2. container.readUnlock()
  B3. container.writeLock()                          <-------------------- escalate, blocks all concurrent readers/writers
  B4. containerData.setVolume(destVolume)
      containerData.setMetadataPath(newMetadataPath)
      containerData.setChunksPath(newChunksPath)
      containerData.setDbFile(newDbFile)
  B5. container.writeUnlock()

Phase C: Disk1 cleanup (uses saved locals — no reference to Disk1 survives in the container object)
  C1. update volume space: destVolume.incrementUsed; oldVolume.decrementUsed
  C2. markContainerForDelete on Disk1 using oldMeta/oldChunks/oldDbFile paths
  C3. delete Disk1 files

This change brings a new challenge to check no data loss or orphan container should be created.

cc: @ChenSammi
Please go through these steps and comment if any issue would be there as part of diskbalancer if we update existing object.

@ChenSammi
Copy link
Copy Markdown
Contributor

ChenSammi commented Apr 30, 2026

@sodonnel, Since the container is move from one volume to another volume, there are multiple information of container need be updated. To archive this, replacing the fields in place in container, or replacing container reference in containerMap, these are the two solutions that can choose. Generally we should choose the one with less complexity, and less cost. If replacing the fields inline, there are at least 4 fields' swaps, one by one. If replacing the container reference, there is one swap. Basically, hot swap one pointer has less complex result than hot swap four pointers, right?

@Gargi-jais11 , you can investigate all existing datanode code places, where it needs MetadataPath, ChunksPath, Volume info, RocksDB for it work, what's the impact if the info is swapped in between. Currently these fields are considered not change(immutable) once the containerdata object is created and put into the containerMap. I have no idea how many edge cases will be there.

@Gargi-jais11
Copy link
Copy Markdown
Contributor Author

Yaa correct @ChenSammi. I am working on other parts of datanodes to analyse where all issues due to staleness can happen and then we can update them taking in bunches if its more. Will update with the complete analysis of dn here once I am done, need some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants